knitr::opts_chunk$set(warning=FALSE)
library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)
library("DMwR")
library("RWeka")
library("C50")
library("rpart")
library("themis")
library(rattle)
library(rpart.plot)
library(RColorBrewer)
Prediction of cyber security employees’ salaries based on 11 attributes & grouping employees based on shared characteristics.
1.work_year
2.experience_level
3.employment_type
4.job_title
5.salary
6.salary_currency
7.salary_in_usd
8.employee_residence
9.remote_ratio
10.company_location
11.company_size
We are living in the “information age” or rather the “data age”, meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there’s a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it’s not an easy matter to predict the future payroll which may hinders some of the organisation’s plans. Another issue arises when the decision makers in the organisation aren’t fully aware of the different groups of employees and their differint needs. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary and privilages that match their needs.
Prediction of the cyber security employees’ salary categories (Very Low, Low, , High, Very High) using classification, and description of data characteristics and behavior and grouping data using clustering methods.
Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee and description of the cybersecurity market behavior by grouping the data, which can help in:
https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries
dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)
we will keep a copy of the original dataset before data preprocessing to use if needed at any time
originalDataset= dataset
No. of attributes: 11
Type of attributes: Ordinal , Nominal, and Numeric
No. of objects: 1247
Class label: salary_in_usd
ncol(dataset)
[1] 11
nrow(dataset)
[1] 1247
names(dataset)
[1] "work_year" "experience_level" "employment_type"
[4] "job_title" "salary" "salary_currency"
[7] "salary_in_usd" "employee_residence" "remote_ratio"
[10] "company_location" "company_size"
str(dataset)
'data.frame': 1247 obs. of 11 variables:
$ work_year : int 2022 2022 2022 2022 2022 2022 2022 2022 2021 2022 ...
$ experience_level : chr "EN" "MI" "MI" "MI" ...
$ employment_type : chr "FT" "FT" "FT" "FT" ...
$ job_title : chr "Cyber Program Manager" "Security Analyst" "Security Analyst" "IT Security Analyst" ...
$ salary : int 63000 95000 70000 250000 120000 315000 220000 140000 55000 360000 ...
$ salary_currency : chr "USD" "USD" "USD" "BRL" ...
$ salary_in_usd : int 63000 95000 70000 48853 120000 315000 220000 140000 75650 360000 ...
$ employee_residence: chr "US" "US" "US" "BR" ...
$ remote_ratio : int 50 0 0 50 100 100 100 100 50 100 ...
$ company_location : chr "US" "US" "US" "BR" ...
$ company_size : chr "S" "M" "M" "L" ...
| Attribute Name | Description | Data Type | Possible values |
|---|---|---|---|
| work_year | The year in which salary was recorded | Numerical | 2020 to 2022 |
| experience_level | Expertise level of the employee | Ordinal | En “Entry level” MI “Mid level” SE “Senior level” EX “Executive level” |
| employment_type | The nature or category of employee’s engagement in the job | Nominal | PT “Part time” FT “Full time” CT “Contract FL”Freelancer” |
| job_title | The role worked in during the year | Nominal | Different titles. like Security Analyst, security researcher |
| salary | The total gross salary amount paid | Numerical | 1740-50001566 |
| salary_currency | The currency of the salary paid to the employee | Nominal | Different currencies according to ISO 4217 currency code. like DE,CA |
| salary_in_usd | The salary paid in United states dollar | Numerical | 2000 to 365596.40 |
| employee_residence | Employee’s primary country of residence | Nominal | Different countries. like US,AE |
| remote_ratio | Percentage of online work by employee in the specified year | Numerical | 0 “No remote work” 50 “Partially remote” 100 “Fully remote” |
| company_location | The country of the employer’s main office | Nominal | Different countries. like BR,BW |
| company_size | How big/small is the company | Ordinal | S , M or L |
using sample_n(table,size) function and using (set_seed())
set.seed(30)
sample=sample_n(dataset,20)
print(sample)
if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.
is.na(dataset)
work_year experience_level employment_type job_title salary
[1,] FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE FALSE
salary_currency salary_in_usd employee_residence remote_ratio
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE
company_location company_size
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE FALSE
[19,] FALSE FALSE
[20,] FALSE FALSE
[21,] FALSE FALSE
[22,] FALSE FALSE
[23,] FALSE FALSE
[24,] FALSE FALSE
[25,] FALSE FALSE
[26,] FALSE FALSE
[27,] FALSE FALSE
[28,] FALSE FALSE
[29,] FALSE FALSE
[30,] FALSE FALSE
[31,] FALSE FALSE
[32,] FALSE FALSE
[33,] FALSE FALSE
[34,] FALSE FALSE
[35,] FALSE FALSE
[36,] FALSE FALSE
[37,] FALSE FALSE
[38,] FALSE FALSE
[39,] FALSE FALSE
[40,] FALSE FALSE
[41,] FALSE FALSE
[42,] FALSE FALSE
[43,] FALSE FALSE
[44,] FALSE FALSE
[45,] FALSE FALSE
[46,] FALSE FALSE
[47,] FALSE FALSE
[48,] FALSE FALSE
[49,] FALSE FALSE
[50,] FALSE FALSE
[51,] FALSE FALSE
[52,] FALSE FALSE
[53,] FALSE FALSE
[54,] FALSE FALSE
[55,] FALSE FALSE
[56,] FALSE FALSE
[57,] FALSE FALSE
[58,] FALSE FALSE
[59,] FALSE FALSE
[60,] FALSE FALSE
[61,] FALSE FALSE
[62,] FALSE FALSE
[63,] FALSE FALSE
[64,] FALSE FALSE
[65,] FALSE FALSE
[66,] FALSE FALSE
[67,] FALSE FALSE
[68,] FALSE FALSE
[69,] FALSE FALSE
[70,] FALSE FALSE
[71,] FALSE FALSE
[72,] FALSE FALSE
[73,] FALSE FALSE
[74,] FALSE FALSE
[75,] FALSE FALSE
[76,] FALSE FALSE
[77,] FALSE FALSE
[78,] FALSE FALSE
[79,] FALSE FALSE
[80,] FALSE FALSE
[81,] FALSE FALSE
[82,] FALSE FALSE
[83,] FALSE FALSE
[84,] FALSE FALSE
[85,] FALSE FALSE
[86,] FALSE FALSE
[87,] FALSE FALSE
[88,] FALSE FALSE
[89,] FALSE FALSE
[90,] FALSE FALSE
[ reached getOption("max.print") -- omitted 1157 rows ]
sum(is.na(dataset))
[1] 0
The summary statistics for the dataset variables provide insights into the distribution of features. we can conclude the following:
In work_year: The data spans from the year 2020 to 2022 with Most data falling within the years 2021 and 2022, as indicated by both the median and mean being centered around 2021.
In salary: Salaries vary widely with a minimum of $1,740 and a maximum of $500 million. The median is $120,000 which is a mid value, but the mean is notably higher at $560,852 which might be duo to extreme values or notable skewness.
In salary_in_usd: The data has a median of $110,000, and a mean of $120,278, and the spread of salaries is observable in the difference between the median and mean.
In remote_ratio: Indicates the percentage of remote work ranging from 0% to 100%, with a median and 3rd quartile at 100%, and a mean of 71.49%, indicating a notable presence of remote work in the dataset, suggesting some variability.
summary(dataset$work_year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2020 2021 2021 2021 2022 2022
summary(dataset$salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1740 79754 120000 560852 160080 500000000
summary(dataset$salary_in_usd)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2000 74595 110000 120278 150000 910991
summary(dataset$remote_ratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 50.00 100.00 71.49 100.00 100.00
variance is to understand the spread or dispersion of the values in each column. A higher variance indicates that the values are more spread out from the mean and in our dataset the highest varied attribute is salary, while a lower variance indicates that the values are closer to the mean which in our datas it is work year attribute.
Variance results reveal that: -work years are to some extent consistent -salaries show notable variability and possible outliers -salaries in USD have a stable distribution -remote work ratio have moderate variability
var(dataset$work_year)
[1] 0.511942
var(dataset$salary)
[1] 2.004897e+14
var(dataset$salary_in_usd)
[1] 4940880203
var(dataset$remote_ratio)
[1] 1548.175
Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
The “salary” column gives the same information as “salary_in_usd” it’s just a matter of currency exchange, and we will eventually transform all the values in “salary” column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.
we will start by creating a temporary column named “converted_salary” to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with “salary” column
convertedDataset=dataset
convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate
set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)
print(salary_sample)
as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)
[1] 0.9999943
The correlation is so high but it hasn’t reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.
To make the mining process more effiecent and has an improved quality, we decided to remove the “salary” column.
dataset<-dataset[,-c(5)]
We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.
now we show (salary_in_usd) attributes’ outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.
boxplot(dataset$salary_in_usd)
OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]
now we show (remote_ratio) attributes’ outliers. we can see there aren’t outliers in remote_ratio, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$remote_ratio)
now we show (work_year) attributes’ outliers. we can see there aren’t outliers in work_year, thus we don’t need the last step i.e: removing outliers’ rows.
boxplot(dataset$work_year)
the columns “company_location” and “employee_residence” have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.
We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:
East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.
Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.
Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.
Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.
North America: This is predominantly United States and Canada.
South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.
Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.
Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don’t belong to any of these regions, thus, they will be used as they are.
um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")
dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
Warning: Some values were not matched unambiguously: AQ, UM
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")
dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"
Concept hierarchy generation can be done for “job_title” as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.
## Create the categories based on job rank
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
ifelse(grepl("Architect", dataset$job_title), "Architect",
ifelse(grepl("Engineer", dataset$job_title), "Engineer",
ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
"Others"))))))
To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.
dataset$job_title <- factor(dataset$job_title)
dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))
dataset$employment_type <- factor(dataset$employment_type)
dataset$employee_residence <- factor(dataset$employee_residence)
dataset$company_location <- factor(dataset$company_location)
dataset$salary_currency <- factor(dataset$salary_currency)
dataset$job_title <- factor(dataset$job_title)
dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))
dataset$job_title <- factor(dataset$job_title)
by calculating breaks based on quartiles
breaks <- quantile(dataset$salary_in_usd,
probs = c(0, .25, .5, .75, .95, 1),
na.rm = TRUE)
dataset$salary_in_usd <- cut(dataset$salary_in_usd,
breaks = breaks,
include.lowest = TRUE,
labels=c("Very Low", "Low", "Medium", "High", "Very High"))
to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight
dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])
we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.
we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose “Random Forest” because it has high accuracy, can handle categorical data.
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
number = 10)
First we save the features to be used in the feature selection(every attributes except the class label “salary_in_usd”) in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.
x <- dataset %>%
select(-salary_in_usd) %>%
as.data.frame()
# Target variable
y <- dataset$salary_in_usd
# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]
x_train <- x[ inTrain, ]
x_test <- x[-inTrain, ]
y_train <- y[ inTrain]
y_test <- y[-inTrain]
after splitting the data, now we can perform the selection using rfe
set.seed(1)
result_rfe1 <- rfe(x = x_train,
y = y_train,
sizes = c(1:9),
rfeControl = control)
result_rfe1
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
The top 5 variables (out of 8):
experience_level, employee_residence, salary_currency, company_location, work_year
predictors(result_rfe1)
[1] "experience_level" "employee_residence" "salary_currency"
[4] "company_location" "work_year" "job_title"
[7] "company_size" "remote_ratio"
The results show that all the remaining attributes, except for “employment_type”, are selected. This is logical, as 98% of the rows have the value “FT”, as shown in the table below. Due to the low variance, we decided to remove this attribute.
table(dataset$employment_type)
CT FL FT PT
11 3 1224 8
dataset<-dataset[,-which(names(dataset)=="employment_type")]
During this phase, our focus will be on clustering and classification techniques to analyze the data. The primary objectives are to identify distinct groups within the dataset through clustering, classify data objects into meaningful categories, and apply different evaluation methods to assess the accuracy and precision of both classification and clustering results. We aim to gain deeper insights into the data and discover patterns.
# Read the CSV file from github
dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)
# Identify the character variables in the dataset2
char_vars <- sapply(dataset2, is.character)
# Convert the identified character variables in dataset2 to factors
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)
To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples
set.seed(10)
balanced_dataset <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)
The goal of all preceding steps is to properly prepare the dataset for the classification and later clustering phase, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels.
since our dataset is small, we decided to use K-fold Cross-validation as a dataset partioning method. for each attribute selection method we will try different K size (10,5, and 3)
in all this section we will be using train and trainControl functions of caret package to produce decision trees. for Gini index the method will be “rpart” and for Gain ratio it’s “j48” as for information gain the method is “C5.0”.
the following function will be used to compute average sensitivity and Specificity:
macro = function(matrix){
sumSen=0
for (i in 1:5) {
sumSen = sumSen + matrix$byClass[i,1]
}
avgSen = sumSen/5
sumSpec=0
for (i in 1:5) {
sumSpec = sumSpec + matrix$byClass[i,2]
}
avgSpec = sumSpec/5
sumPrec=0
for (i in 1:5) {
sumPrec = sumPrec + matrix$byClass[i,3]
}
avgPrec = sumPrec/5
avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Precision=avgPrec ,Accuracy= unname( matrix$overall[1]) )
print(avgs)
}
Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the “rpart” method, which utilizes the Gini index as the criteria for splitting.
The tree of the gini index using 10 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex10 <- train(
salary_in_usd ~ .,
data = balanced_dataset,
method = "rpart",
trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
)
)
prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)
the “experince level” attribute was selected as the first splitting attribute meaning that it has the largest impurity reduction.
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
giniIndex10cm = caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)
giniIndex10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 99 16 37 19 3
Low 28 130 42 1 69
Medium 56 64 88 15 14
Very_High 28 5 19 190 10
Very_Low 6 54 6 9 189
Overall Statistics
Accuracy : 0.5815
95% CI : (0.5529, 0.6096)
No Information Rate : 0.2381
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.4752
Mcnemar's Test P-Value : 0.01206
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.45622 0.4833 0.45833
Specificity 0.92347 0.8491 0.85174
Pos Pred Value 0.56897 0.4815 0.37131
Neg Pred Value 0.88465 0.8501 0.89167
Prevalence 0.18129 0.2247 0.16040
Detection Rate 0.08271 0.1086 0.07352
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.68985 0.6662 0.65504
Class: Very_High Class: Very_Low
Sensitivity 0.8120 0.6632
Specificity 0.9356 0.9178
Pos Pred Value 0.7540 0.7159
Neg Pred Value 0.9534 0.8971
Prevalence 0.1955 0.2381
Detection Rate 0.1587 0.1579
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8738 0.7905
the metrics shown for each class indicate the value of that metric when treating this class as the positive class and the other classes as the negative class. here the classifier showed best performance when using the “very high” class as the positive class but this value in its own doesn’t hold much value since all classes should be taken into consideration.
The tree of the gini index using 5 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
))
prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)
NA
NA
this tree has the same structure as the previous tree that used 10 folds. so in this tree as well “experience level” was choose as the first splitting attribute
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
giniIndex5cm = caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)
giniIndex5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 98 10 43 16 7
Low 28 123 48 0 71
Medium 64 56 87 14 16
Very_High 25 7 22 192 6
Very_Low 4 53 10 7 190
Overall Statistics
Accuracy : 0.5764
95% CI : (0.5479, 0.6046)
No Information Rate : 0.2423
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.4692
Mcnemar's Test P-Value : 0.001289
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.44749 0.4940 0.41429
Specificity 0.92229 0.8449 0.84802
Pos Pred Value 0.56322 0.4556 0.36709
Neg Pred Value 0.88172 0.8641 0.87188
Prevalence 0.18296 0.2080 0.17544
Detection Rate 0.08187 0.1028 0.07268
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.68489 0.6695 0.63116
Class: Very_High Class: Very_Low
Sensitivity 0.8384 0.6552
Specificity 0.9380 0.9184
Pos Pred Value 0.7619 0.7197
Neg Pred Value 0.9608 0.8928
Prevalence 0.1913 0.2423
Detection Rate 0.1604 0.1587
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8882 0.7868
the results are very close to the 10 folds tree, so here as well the classifier shows better performance when dealing with the “very high”
The tree of the gini index using 3 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))
giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
control = list(
minsplit = 10,
minbucket = 5,
xval = 10,
cp = 0.0001
))
prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)
NA
NA
The tree shows similar structure as the two previous two trees, whether it’s in its choose of the splitting attributes or the leaves.
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
giniIndex3cm = caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)
giniIndex3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 99 9 39 19 8
Low 26 114 48 9 73
Medium 62 46 86 20 23
Very_High 31 6 20 189 6
Very_Low 8 50 9 14 183
Overall Statistics
Accuracy : 0.5606
95% CI : (0.5319, 0.5889)
No Information Rate : 0.2448
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.4498
Mcnemar's Test P-Value : 0.0006718
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.43805 0.50667 0.42574
Specificity 0.92276 0.83951 0.84824
Pos Pred Value 0.56897 0.42222 0.36287
Neg Pred Value 0.87586 0.88026 0.87917
Prevalence 0.18881 0.18797 0.16876
Detection Rate 0.08271 0.09524 0.07185
Detection Prevalence 0.14536 0.22556 0.19799
Balanced Accuracy 0.68041 0.67309 0.63699
Class: Very_High Class: Very_Low
Sensitivity 0.7530 0.6246
Specificity 0.9334 0.9104
Pos Pred Value 0.7500 0.6932
Neg Pred Value 0.9344 0.8821
Prevalence 0.2097 0.2448
Detection Rate 0.1579 0.1529
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8432 0.7675
here as well the “very high” class has the best overall performance
All three trees seem to be alike in their arrangement and form.
rbind("10 Folds"=macro(giniIndex10cm), "5 Folds"=macro(giniIndex5cm), "3 Folds"=macro(giniIndex3cm) )
The higher values for sensitivity, specificity, precision, and accuracy in the 10-fold case indicate better overall performance according to these metrics. so,Gini Index model performs better with a 10-fold cross-validation compared to 5 and 3 folds.
The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.
The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.
The tree of the gain ratio using 10 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)
the first splitting attribute that was choosen is the “Expeirence level” attribute meaning that it probably has a high information gain and low splitInfo(Entropy of distribution of tuples into partition)
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)
gainRatio10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 100 18 32 21 3
Low 28 148 40 3 51
Medium 53 49 115 14 6
Very_High 28 6 18 194 6
Very_Low 2 39 11 5 207
Overall Statistics
Accuracy : 0.6383
95% CI : (0.6103, 0.6655)
No Information Rate : 0.2281
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.5465
Mcnemar's Test P-Value : 0.167
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.47393 0.5692 0.53241
Specificity 0.92495 0.8698 0.87564
Pos Pred Value 0.57471 0.5481 0.48523
Neg Pred Value 0.89150 0.8792 0.89479
Prevalence 0.17627 0.2172 0.18045
Detection Rate 0.08354 0.1236 0.09607
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69944 0.7195 0.70402
Class: Very_High Class: Very_Low
Sensitivity 0.8186 0.7582
Specificity 0.9396 0.9383
Pos Pred Value 0.7698 0.7841
Neg Pred Value 0.9545 0.9293
Prevalence 0.1980 0.2281
Detection Rate 0.1621 0.1729
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8791 0.8483
here the classifier shows better performance when treating “very high” and “very low” attributes as positive class. since the “very high” class is better in Sensitivity and “very low” is better in Specificity and precision (Pos Pred Value)
The tree of the gain ratio using 5 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)
the tree is similar to the tree that was resulted from 10 folds. it has choose “Experience level” as the first splitting attribute and and seem to show similar behavior.
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)
gainRatio5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 102 21 34 16 1
Low 31 148 38 1 52
Medium 56 46 103 15 17
Very_High 30 7 18 194 3
Very_Low 3 42 10 9 200
Overall Statistics
Accuracy : 0.6241
95% CI : (0.5959, 0.6516)
No Information Rate : 0.2281
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5289
Mcnemar's Test P-Value : 0.007667
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.45946 0.5606 0.50739
Specificity 0.92615 0.8692 0.86519
Pos Pred Value 0.58621 0.5481 0.43460
Neg Pred Value 0.88270 0.8749 0.89583
Prevalence 0.18546 0.2206 0.16959
Detection Rate 0.08521 0.1236 0.08605
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69281 0.7149 0.68629
Class: Very_High Class: Very_Low
Sensitivity 0.8255 0.7326
Specificity 0.9397 0.9307
Pos Pred Value 0.7698 0.7576
Neg Pred Value 0.9566 0.9218
Prevalence 0.1963 0.2281
Detection Rate 0.1621 0.1671
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8826 0.8317
unlike the 10 folds, here the classifier has the best overall performance when considering the “very high” as the positive class.
The tree of the gain ratio using 3 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)
the tree shows similar behavior as the previous 2 trees that resulted from using 10 and 5 folds.
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)
gainRatio3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 94 18 39 19 4
Low 25 129 39 3 74
Medium 47 47 110 19 14
Very_High 14 4 27 200 7
Very_Low 3 32 9 12 208
Overall Statistics
Accuracy : 0.619
95% CI : (0.5909, 0.6467)
No Information Rate : 0.2565
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5216
Mcnemar's Test P-Value : 0.007322
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.51366 0.5609 0.4911
Specificity 0.92110 0.8542 0.8695
Pos Pred Value 0.54023 0.4778 0.4641
Neg Pred Value 0.91300 0.8910 0.8812
Prevalence 0.15288 0.1921 0.1871
Detection Rate 0.07853 0.1078 0.0919
Detection Prevalence 0.14536 0.2256 0.1980
Balanced Accuracy 0.71738 0.7075 0.6803
Class: Very_High Class: Very_Low
Sensitivity 0.7905 0.6775
Specificity 0.9449 0.9371
Pos Pred Value 0.7937 0.7879
Neg Pred Value 0.9439 0.8939
Prevalence 0.2114 0.2565
Detection Rate 0.1671 0.1738
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8677 0.8073
similar to teh 5 folds, the “very high” class has the best metrics.
The observed structure of the three decision trees seems to be the same and it can be summarized as follows:
The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.
rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm) )
Based on the evaluation metrics of average Sensitivity,Precision ,Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it’s worth noting that the difference in performance between the models is relatively small.
A detailed examination of the results from the 10-fold cross-validation reveals that the model has a notably high specificity compared to other metrics. This high specificity suggests that the model is particularly effective at correctly identifying instances that do not pertain to the target class—essentially, it accurately recognizes when examples are not members of the specified class. For example, if the positive class in question is “High” then the model is able to correctly classify tuples that belong to “Very Low”, “Medium”, and “Very High”.
However, possessing high specificity alone does not guarantee the overall effectiveness of the model, as a well-rounded model also requires balanced performance across other metrics. In this case, its ability to capture and classify instances that do belong to the positive class (as measured by sensitivity) is not as robust. For a model to be considered truly effective, it would need to demonstrate strong performance in all metrics specificity and sensitivity, ensuring it can accurately distinguish both negative and positive instances as well as accuracy precision.
Information Gain is a metric used to decide which attribute to choose for splitting the data at each node in the decision tree. For a given dataset, the Information Gain of an attribute is calculated by comparing the entropy before and after the dataset is split based on that attribute. The attribute with the highest Information Gain is chosen as the splitting attribute.
The tree of the information gain using 10 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 3 for this object. Predictions generated using 3 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain10$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)
from the tree, the “experince level” attribute was the first selected splitting attribute meaning that it has the highest information gain among all attributes
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
infoGain10cm= caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)
infoGain10cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 90 19 40 20 5
Low 24 127 42 10 67
Medium 51 58 96 19 13
Very_High 17 3 15 208 9
Very_Low 3 43 4 10 204
Overall Statistics
Accuracy : 0.6057
95% CI : (0.5773, 0.6335)
No Information Rate : 0.249
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.5046
Mcnemar's Test P-Value : 0.03427
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.48649 0.5080 0.4873
Specificity 0.91700 0.8490 0.8590
Pos Pred Value 0.51724 0.4704 0.4051
Neg Pred Value 0.90714 0.8673 0.8948
Prevalence 0.15455 0.2089 0.1646
Detection Rate 0.07519 0.1061 0.0802
Detection Prevalence 0.14536 0.2256 0.1980
Balanced Accuracy 0.70174 0.6785 0.6732
Class: Very_High Class: Very_Low
Sensitivity 0.7790 0.6846
Specificity 0.9527 0.9333
Pos Pred Value 0.8254 0.7727
Neg Pred Value 0.9376 0.8992
Prevalence 0.2231 0.2490
Detection Rate 0.1738 0.1704
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8659 0.8089
similar to the trees from the gini index and gain ratio, the classifier seem to have better performance when treating the “very high” class as the positive class
The tree of the information gain using 5 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 8 for this object. Predictions generated using 8 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain5$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)
the tree has similar behavior as the 10 folds information gain tree
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
infoGain5cm = caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)
infoGain5cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 20 47 16 6
Low 25 129 38 10 68
Medium 37 70 99 14 17
Very_High 16 4 13 210 9
Very_Low 2 37 9 13 203
Overall Statistics
Accuracy : 0.6065
95% CI : (0.5782, 0.6343)
No Information Rate : 0.2531
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5049
Mcnemar's Test P-Value : 0.001691
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.51515 0.4962 0.48058
Specificity 0.91376 0.8495 0.86075
Pos Pred Value 0.48851 0.4778 0.41772
Neg Pred Value 0.92180 0.8587 0.88854
Prevalence 0.13784 0.2172 0.17210
Detection Rate 0.07101 0.1078 0.08271
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.71446 0.6728 0.67066
Class: Very_High Class: Very_Low
Sensitivity 0.7985 0.6700
Specificity 0.9550 0.9318
Pos Pred Value 0.8333 0.7689
Neg Pred Value 0.9439 0.8928
Prevalence 0.2197 0.2531
Detection Rate 0.1754 0.1696
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8768 0.8009
the classifier shows very close performance to the 10 folds information gain model
The tree of the information gain using 3 folds
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)
Warning: 'trials' should be <= 4 for this object. Predictions generated using 4 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 5 for this object. Predictions generated using 5 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 7 for this object. Predictions generated using 7 trialsWarning: 'trials' should be <= 9 for this object. Predictions generated using 9 trialsWarning: 'trials' should be <= 6 for this object. Predictions generated using 6 trials
c5model <- C5.0(salary_in_usd ~ .,
data = balanced_dataset,
trials = infoGain3$bestTune$trials,
rules = FALSE,
control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)
the visisble parts of the tree seem to behave the same as the prevoius 2 fold sizes- 10 and 5.
The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset
infoGain3cm = caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)
infoGain3cm
Confusion Matrix and Statistics
Reference
Prediction High Low Medium Very_High Very_Low
High 85 25 40 19 5
Low 19 137 37 5 72
Medium 55 64 93 11 14
Very_High 16 8 22 197 9
Very_Low 4 43 11 9 197
Overall Statistics
Accuracy : 0.5923
95% CI : (0.5639, 0.6203)
No Information Rate : 0.2481
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.4874
Mcnemar's Test P-Value : 0.01149
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.47486 0.4946 0.45813
Specificity 0.91257 0.8554 0.85513
Pos Pred Value 0.48851 0.5074 0.39241
Neg Pred Value 0.90811 0.8490 0.88542
Prevalence 0.14954 0.2314 0.16959
Detection Rate 0.07101 0.1145 0.07769
Detection Prevalence 0.14536 0.2256 0.19799
Balanced Accuracy 0.69372 0.6750 0.65663
Class: Very_High Class: Very_Low
Sensitivity 0.8174 0.6633
Specificity 0.9425 0.9256
Pos Pred Value 0.7817 0.7462
Neg Pred Value 0.9534 0.8928
Prevalence 0.2013 0.2481
Detection Rate 0.1646 0.1646
Detection Prevalence 0.2105 0.2206
Balanced Accuracy 0.8799 0.7944
since the tree is essentially similar to the previous two information gain trees the results that this tree shows is very close in performance to them as well.
The observed structure of the three decision trees seems to be the same and it can be summarized as follows:
Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the “experience level.” This divides the tree into two main branches or subtrees:
Within the right subtree: In the right sub tree if the experience level is 4(EX) the tree will be divided based on “Company location”
Within the left subtree: In the left subtree it will divide the tree for both two experience levels 1(EN) and 2(MI) based on “employee residence” and when the “employee residence” is “North America” the tree will be further divided based on “salary currency” and when this attribute is equal to “USD” the division will be based on the “job title” attribute
The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.
rbind("10 Folds"=macro(infoGain10cm), "5 Folds"=macro(infoGain5cm), "3 Folds"=macro(infoGain3cm) )
Based on the provided sensitivity, specificity, precision, and accuracy values there isn’t a clear indication of the superiority of one fold over another for Information Gain model .we may need to consider additional factors or conduct further analysis to make a well-informed decision. as can be seen in the table the 10 folds has the best Specificity and Precision, meanwhile the 5 folds has the best Sensitivity and Accuracy.
The gain ratio evaluated with 10-fold cross-validation appears to has the best performance among all the decision tree models. This might be due to the fact that the gain ratio tends to favor unbalanced splits, where one partition is significantly smaller than others. In our dataset, if an attribute has a rare value, the gain ratio might prioritize splitting on this attribute despite the resulting unbalance.
Despite the superiority of the 10-fold, the performance metrics of all three fold sizes (3-fold, 5-fold, and 10-fold) using Gini index, gain ratio, and information gain are relatively similar, suggesting that all three measures are robust within the context of this dataset. A likely contributing factor to this performance consistency is the balanced distribution of class labels in the dataset. When classes are balanced, each splitting criterion is more or less equally likely to encounter informative splits, which serves to decrease performance variability across different splitting methods.
Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group “cluster” are similar to one another and different from data in other clusters. And for this data mining task We will utilize k-means clustering.
we will encode the rest of factor columns to transform them into numeric types before clustering, enabling meaningful distance calculations using kmeans and other formulas, and allowing for maximum flexibility in data processing and interpretation. we will also remove the class label from the dataset as clustering is an unsupervised learning process, and we will preserve this class label in an attribute for later use. lastly, we will scale all numeric attributes in the dataset so they will be standarized.
# view data
dataset3 <- dataset2
View(dataset3)
# Reserve the salary_in_usd (the class label) column in an attribute before removing it from the dataset for clustering
classLabel <- dataset3[, 5]
# Remove the class lable from the dataset
dataset3 <- dataset3[, -5]
# encoding job_title variable
dataset3$job_title = factor(dataset3$job_title, levels=c("Analyst", "Architect", "Engineer", "Leadership", "Consultant/Specialist","Cyber Security","Others" ), labels=c(4,1,2,5,3,6,7))
# encoding salary_currency variable
dataset3$salary_currency = factor(dataset3$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21))
# encoding employee_residence variable
dataset3$employee_residence = factor(dataset3$employee_residence, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa"), labels=c(1,2,3,4,5,6,7))
# encoding company_location variable
dataset3$company_location = factor(dataset3$company_location, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa", "AQ", "UM"), labels=c(1,2,3,4,5,6,7,8,9))
#Data types to be transformed into numeric types before clustering
#Transforming all non-numeric attributes to numeric type
dataset3$experience_level <- as.numeric(as.character(dataset3$experience_level))
dataset3$job_title <- as.numeric(as.character(dataset3$job_title))
dataset3$salary_currency <- as.numeric(as.character(dataset3$salary_currency))
dataset3$employee_residence <- as.numeric(as.character(dataset3$employee_residence))
dataset3$company_location <- as.numeric(as.character(dataset3$company_location))
dataset3$company_size <- as.numeric(as.character(dataset3$company_size))
# viwe the class of attributes to ensure they have transformed to numeric
sapply(dataset3, class)
work_year experience_level job_title
"numeric" "numeric" "numeric"
salary_currency employee_residence remote_ratio
"numeric" "numeric" "numeric"
company_location company_size
"numeric" "numeric"
#scale all attributes in the dataset so they would be standardized
dataset3 <- scale(dataset3)
After preprocessing the data, now we are ready to perform the clustering process, we will use the k-means clustering, it is a clustering method that aims to minimize the sum of squared distances between each data point and the centroid of its assigned cluster by iteratively updating cluster assignments and centroids.
We will choose 3 different numbers to perform the k-means clustering on, one of the numbers should be relatevily large, the second should be in the middle and the last should be small. This way we will cover the possible outcomes and clustering results.
Now we will apply Silhouette method to find the optimal number of clusters k, we will also plot a graph where x-axis represent the number of clusters and y-axis represent the average Silhouette coefficient
fviz_nbclust(dataset3, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
as seen by the graph, the number of clusters k that maximizes the average Silhouette coefficient is 2, so we will use it for clustering.
This method determines the number of clusters according to the turning point in a curve, the curve is plotted using the total within-cluster sum of square (WSS) as in y-axis , and No. clusters in x-axis
fviz_nbclust(dataset3, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
As shown, the number of clusters k that represents the turning point in the curve is 4, so we will use it for clustering.
Lastly, we will use k=3 since it acheives the second highest average Silhouette coefficient, and since it’s in the middle between 2 and 4 it will strike a balance between having too few clusters (k=2), and having several clusters (k=4), Thus, this choice will have an acceptable acuuracy.
In this section, we will perform k-means clustering and visualize its result using three different k’s that have been chosen beforehand, then we will compute WSS and Bcubed preceision and recall and average silhouette for each cluster as methods of evaluating clustering results.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 2 clusters
kmeans.result <- kmeans(dataset3, 2)
# print the clusterng result
kmeans.result
K-means clustering with 2 clusters of sizes 938, 308
Cluster means:
work_year experience_level job_title salary_currency
1 0.1123321 0.07766291 -0.05252131 -0.3315729
2 -0.3421024 -0.23651887 0.15995128 1.0097901
employee_residence remote_ratio company_location company_size
1 -0.5368064 0.08024947 -0.5259331 0.0410645
2 1.6348196 -0.24439612 1.6017052 -0.1250601
Clustering vector:
[1] 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 2 1 2 1 1 1 1 1 1
[35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1
[69] 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1
[103] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2
[137] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[171] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1
[205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[239] 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
[273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[307] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[341] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1
[375] 1 1 1 1 1 2 2 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1
[409] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 1
[443] 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2
[477] 1 1 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2
[511] 1 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1
[545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1
[579] 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[613] 1 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[647] 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[681] 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1
[715] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 1 1 1 2
[749] 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1
[783] 1 2 2 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 1
[817] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
[851] 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1
[885] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[919] 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1
[953] 2 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2
[987] 2 2 2 2 1 1 1 1 1 2 1 1 1 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 4608.187 2679.470
(between_SS / total_SS = 26.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 7287.657
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# compute avg precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall)) }
# compute BCubed precision and recall
metrics <- bcubed(data)
precision <- metrics$precision
recall <- metrics$recall
# Print results
cat("BCubed Precision is:", precision, "\n")
BCubed Precision is: 0.2814606
cat("BCubed Recall is:", recall, "\n")
BCubed Recall is: 0.7072104
we can conclude from the graph and the results that the k=2 is the optimal k, since there is no overlapping between the two clusters, the data in a cluster are close “similar” to each other and dissimilar to data in the other cluster. Also, the recall is relatively high (0.71) and is the highest among the k’s chosen, the Precision is low (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 26.8%, which is an excellent percentage indicating a good compactness of clusters, and that objects in a cluster are similar to one another. Lastly, the average silhouette width is 0.34 which is considered high reflecting high intra-cluster similarity.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 3 clusters
kmeans.result <- kmeans(dataset3, 3)
# print the clusterng result
kmeans.result
K-means clustering with 3 clusters of sizes 475, 302, 469
Cluster means:
work_year experience_level job_title salary_currency
1 -0.3742571 -0.4628902 -0.004526137 -0.3360410
2 -0.3447469 -0.2440022 0.169029312 1.0307121
3 0.6010356 0.6259307 -0.104257862 -0.3233595
employee_residence remote_ratio company_location company_size
1 -0.5399390 -0.4866784 -0.5233822 0.2231005
2 1.6375401 -0.2426204 1.6311505 -0.1312858
3 -0.5076036 0.6491335 -0.5202578 -0.1414167
Clustering vector:
[1] 1 1 1 1 2 3 3 3 2 2 1 3 3 1 3 3 3 1 1 2 2 2 2 1 1 2 3 2 3 3 1 3 1 1
[35] 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3 3 2 1 1 1 3 1 1 2 3 2 3 3 2 2 1 3 3 3
[69] 3 3 3 3 3 3 3 2 1 1 2 3 3 3 3 3 3 3 3 3 2 3 3 2 1 2 3 3 3 3 3 3 2 1
[103] 3 3 3 3 3 3 3 1 3 3 3 2 2 1 1 1 1 3 1 2 2 2 2 2 3 3 3 3 3 3 1 3 3 2
[137] 2 2 2 3 3 1 1 3 3 3 1 1 3 3 3 3 3 3 3 2 3 3 1 1 1 1 3 3 3 1 1 3 3 3
[171] 3 3 3 1 1 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 2 2 2 3 3 3 3 1 1
[205] 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1
[239] 3 3 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 2 2 3 3 3 3 3 3
[273] 1 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3
[307] 3 3 1 1 3 3 3 3 3 3 3 3 2 2 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[341] 3 3 3 3 3 3 1 1 1 1 3 3 2 3 2 1 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 1 3
[375] 3 3 1 3 3 2 2 1 3 2 3 3 3 2 2 3 3 2 2 1 3 1 3 3 3 2 2 3 1 1 3 1 3 3
[409] 3 3 1 3 3 1 3 1 1 2 2 3 3 3 1 3 3 3 3 1 3 1 3 1 2 3 2 1 2 2 3 2 3 3
[443] 3 1 2 1 3 2 3 1 1 1 1 3 3 1 1 2 1 3 3 3 3 2 1 3 1 3 1 3 1 1 1 2 2 2
[477] 1 1 1 3 2 2 2 1 1 3 2 3 1 2 2 3 2 2 3 1 2 3 2 2 3 1 1 1 1 3 3 3 3 2
[511] 1 2 3 2 3 1 3 2 1 3 2 3 3 3 2 1 3 1 3 3 2 2 2 2 3 3 3 3 1 3 3 1 1 3
[545] 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 3 2 3 1 2 2 3 3 1 1 1 3
[579] 3 3 3 1 1 1 1 1 2 3 1 1 2 2 1 2 3 3 2 2 1 3 3 3 3 1 3 3 3 3 1 1 3 3
[613] 3 2 2 2 2 3 3 3 3 2 2 2 2 2 1 1 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 2
[647] 3 1 1 3 3 3 3 2 1 2 3 1 2 1 2 3 3 3 1 1 3 3 3 1 3 1 3 1 1 3 1 1 1 1
[681] 3 1 3 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 1 3 3 3 2 2 2 1 1 1 1 1 1 1 1
[715] 1 1 3 2 3 1 1 1 3 1 1 3 1 1 1 1 1 1 3 1 2 2 1 3 1 1 3 2 1 2 1 1 1 2
[749] 3 1 1 1 1 1 2 2 2 2 2 3 3 3 1 2 2 1 1 2 2 3 1 3 1 3 3 3 1 1 3 1 2 1
[783] 1 2 2 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 3 1 2 3 1 1 3 2 1
[817] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 3 1 1 1 1 1 3 3 1 2 1 1 1 1 1 1 3 1 1
[851] 1 3 3 2 3 1 1 1 2 1 1 1 1 2 1 3 1 2 2 1 1 1 2 1 2 1 3 2 1 1 2 1 3 1
[885] 1 1 1 1 1 1 1 2 1 3 1 3 1 1 1 3 1 1 2 3 3 1 1 1 1 1 1 1 1 1 3 3 3 1
[919] 2 2 3 1 2 1 3 2 3 1 1 1 1 1 1 1 2 1 3 1 1 1 3 1 3 1 1 2 1 1 3 2 2 1
[953] 2 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 3 3 1 1 2 1 1 1 1 1 2 2 1 2
[987] 2 2 2 2 1 1 1 1 1 2 1 1 1 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 2262.211 2615.908 1573.419
(between_SS / total_SS = 35.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 6451.538
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# compute avg precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# compute BCubed precision and recall
metrics <- bcubed(data)
precision <- metrics$precision
recall <- metrics$recall
# Print results
cat("BCubed Precision is:", precision, "\n")
BCubed Precision is: 0.3059101
cat("BCubed Recall is:", recall, "\n")
BCubed Recall is: 0.4537324
we can conclude from the graph and the results where k=3 is that the performance is good but worse than k=2, because there is overlapping between clusters. In addition, the recall is relatively high (0.68), However, the Precision is low (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 32%, which is a good percentage indicating an intermidiate compactness of clusters, and that objects in a cluster are to some extent similar to one another. Lastly, the average silhouette width is 0.31 which is good reflecting a good intra-cluster similarity.
#Use seed to guarantee replicability of random processes
set.seed(8953)
# run k-means clustering to find 4 clusters
kmeans.result <- kmeans(dataset3, 4)
# print the clusterng result
kmeans.result
K-means clustering with 4 clusters of sizes 291, 259, 48, 648
Cluster means:
work_year experience_level job_title salary_currency
1 -0.2065990 -0.26057796 -0.07392709 -0.3742177
2 -0.2746520 -0.26852805 0.08713240 0.4713219
3 -0.7039739 -0.05919153 0.57502683 3.9452389
4 0.2547005 0.22873170 -0.04422191 -0.3125717
employee_residence remote_ratio company_location company_size
1 -0.5494700 -1.2790118 -0.5320764 0.03437756
2 1.6082647 -0.1776153 1.5917788 -0.13081536
3 1.7560475 -0.5190540 1.7138693 -0.07831572
4 -0.5261344 0.6838108 -0.5242318 0.04264886
Clustering vector:
[1] 1 1 1 1 2 4 4 4 2 2 1 4 4 1 4 4 4 4 4 2 2 2 2 1 1 2 4 2 4 4 4 4 1 1
[35] 4 1 4 4 4 4 1 4 4 1 4 4 1 4 4 4 2 1 1 1 1 1 1 3 4 2 4 4 2 2 1 4 4 4
[69] 4 4 4 4 4 4 4 2 4 4 2 4 4 4 4 4 4 4 4 4 3 4 4 3 1 2 4 4 4 4 4 4 2 4
[103] 4 4 4 4 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 2 2 2 2 2 1 4 4 4 4 4 4 4 4 2
[137] 2 2 2 4 4 1 1 4 4 4 4 1 4 4 4 4 4 4 4 2 4 4 4 4 1 1 4 4 4 1 1 4 4 4
[171] 4 4 4 1 1 4 4 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 2 2 2 2 4 4 4 4 1 1
[205] 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1
[239] 4 4 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 4 2 2 4 4 4 4 4 4
[273] 1 1 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 4 4 4
[307] 4 4 1 1 4 4 4 4 4 4 4 4 2 2 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[341] 4 4 4 4 4 4 1 1 1 1 4 4 2 4 2 4 4 4 4 4 4 4 4 4 4 1 1 2 2 4 4 4 1 4
[375] 4 4 4 4 4 2 2 1 4 2 4 4 4 2 2 4 4 2 2 4 4 1 4 4 4 2 2 4 1 1 4 1 4 4
[409] 4 4 1 4 4 1 4 4 4 2 2 4 4 4 1 4 4 4 4 1 4 1 4 1 2 4 2 1 2 2 4 2 4 4
[443] 4 1 2 1 4 2 4 1 1 1 1 4 4 1 1 2 4 4 4 4 4 3 4 4 4 4 4 4 1 1 1 2 2 3
[477] 1 1 1 4 2 2 2 1 4 4 2 4 1 2 2 4 3 2 4 1 3 4 2 2 4 4 4 1 4 4 4 4 4 2
[511] 4 2 4 2 4 4 4 2 4 4 2 4 4 4 2 1 4 4 4 4 2 2 3 2 4 4 4 4 1 4 4 4 4 4
[545] 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 2 4 4 2 2 4 4 1 1 1 4
[579] 4 4 4 4 1 1 4 1 2 4 4 1 2 2 4 2 4 4 3 2 4 4 4 4 4 4 4 4 4 4 4 1 4 2
[613] 4 2 2 3 2 1 4 4 4 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3
[647] 4 1 1 4 4 4 4 2 4 2 4 4 2 1 2 4 4 4 1 4 4 4 4 4 4 4 4 1 1 4 1 4 1 4
[681] 4 1 4 1 1 2 3 1 1 2 4 1 4 1 4 2 1 2 1 4 4 4 4 2 2 3 1 1 1 4 1 1 4 4
[715] 1 4 4 2 4 4 1 1 4 1 1 4 1 4 4 4 1 4 4 2 2 2 4 4 4 4 4 2 4 2 1 4 1 2
[749] 4 1 1 4 4 1 2 2 2 2 2 4 4 4 1 2 2 4 1 2 2 4 4 4 1 4 4 4 1 1 4 4 2 1
[783] 1 2 2 2 4 1 1 3 1 3 1 1 1 3 1 4 1 4 2 2 1 4 1 1 3 4 1 2 4 4 4 4 2 4
[817] 1 4 4 4 2 1 4 2 4 4 2 1 1 1 1 4 4 4 1 1 1 4 4 4 3 1 4 1 1 1 4 4 1 4
[851] 4 4 4 3 4 1 1 1 2 1 1 1 2 2 4 4 1 3 3 4 1 1 2 1 2 4 4 2 1 1 2 1 4 4
[885] 4 1 1 1 1 4 4 2 1 4 4 4 4 4 1 4 1 4 2 4 4 4 4 1 1 4 1 4 4 1 4 4 4 4
[919] 2 2 4 1 2 1 4 3 4 4 1 1 4 4 4 4 2 4 4 1 4 4 4 4 4 1 1 2 1 4 4 2 2 1
[953] 2 2 3 3 4 2 3 4 2 4 4 1 1 1 4 1 1 1 2 1 4 4 1 4 2 1 1 4 1 1 2 2 4 3
[987] 3 2 2 3 1 1 1 1 1 2 4 4 4 2
[ reached getOption("max.print") -- omitted 246 entries ]
Within cluster sum of squares by cluster:
[1] 1195.7362 1730.4906 422.4286 2562.3908
(between_SS / total_SS = 40.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3))
fviz_silhouette(avg_sil)
#Within-cluster sum of squares wss
wss <- kmeans.result$tot.withinss
print(wss)
[1] 5911.046
#BCubed
kmeans_cluster <- c(kmeans.result$cluster)
ground_truth <- c(classLabel)
data <- data.frame(cluster = kmeans_cluster, label = ground_truth)
bcubed <- function(data) {
n <- nrow(data)
total_precesion <- 0
total_recall <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
# Number of objects in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
}
# Compute avg precision and recall
precision <- total_precesion / n
recall <- total_recall / n
return(list(precision = precision, recall = recall))
}
# compute BCubed precision and recall
metrics <- bcubed(data)
precision <- metrics$precision
recall <- metrics$recall
# Print results
cat("BCubed Precision is:", precision, "\n")
BCubed Precision is: 0.2886446
cat("BCubed Recall is:", recall, "\n")
BCubed Recall is: 0.4398836
we can conclude from the graph and the results where k=4 is that the performance is worse than k=2 and k=3, because there is a noticeable overlapping between clusters. Also, the clusers’ space is pretty wide which results in a large distance between objects in the same cluster. In addition, the recall is relatively low (0.40) which might be a result of the overlapping and large distances between data objects. Furthermore, the Precision is low (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 40.7%, which is an acceptable percentage indicating a lower compactness of clusters. Lastly, the average silhouette width is 0.2 which is low reflecting high inter-cluster similarity.